Course: Machine Learning and Data Science for Social Good (20S856137)
Authors: Boqin Cai (boqin.cai@stud.sbg.ac.at)
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
In this notebook, I use python to analyze the risk of credit card customers based on the historic data. I use decision tree and random forest to model. And finally I use Grid Search Cross Validation to optimize the parameters of random forest.
application_record_df = pd.read_csv('426827_1031720_bundle_archive/application_record.csv')
credit_record_df = pd.read_csv('426827_1031720_bundle_archive/credit_record.csv')
application_record_df
| ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5008804 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 |
| 1 | 5008805 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 |
| 2 | 5008806 | M | Y | Y | 0 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | -21474 | -1134 | 1 | 0 | 0 | 0 | Security staff | 2.0 |
| 3 | 5008808 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 |
| 4 | 5008809 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 438552 | 6840104 | M | N | Y | 0 | 135000.0 | Pensioner | Secondary / secondary special | Separated | House / apartment | -22717 | 365243 | 1 | 0 | 0 | 0 | NaN | 1.0 |
| 438553 | 6840222 | F | N | N | 0 | 103500.0 | Working | Secondary / secondary special | Single / not married | House / apartment | -15939 | -3007 | 1 | 0 | 0 | 0 | Laborers | 1.0 |
| 438554 | 6841878 | F | N | N | 0 | 54000.0 | Commercial associate | Higher education | Single / not married | With parents | -8169 | -372 | 1 | 1 | 0 | 0 | Sales staff | 1.0 |
| 438555 | 6842765 | F | N | Y | 0 | 72000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -21673 | 365243 | 1 | 0 | 0 | 0 | NaN | 2.0 |
| 438556 | 6842885 | F | N | Y | 0 | 121500.0 | Working | Secondary / secondary special | Married | House / apartment | -18858 | -1201 | 1 | 0 | 1 | 0 | Sales staff | 2.0 |
438557 rows × 18 columns
credit_record_df
| ID | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5001711 | 0 | X |
| 1 | 5001711 | -1 | 0 |
| 2 | 5001711 | -2 | 0 |
| 3 | 5001711 | -3 | 0 |
| 4 | 5001712 | 0 | C |
| ... | ... | ... | ... |
| 1048570 | 5150487 | -25 | C |
| 1048571 | 5150487 | -26 | C |
| 1048572 | 5150487 | -27 | C |
| 1048573 | 5150487 | -28 | C |
| 1048574 | 5150487 | -29 | C |
1048575 rows × 3 columns
print(f"The shape of application_record_df {application_record_df.shape}")
print(f"The shape of credit_record_df {credit_record_df.shape}")
print(f"The number of distinct IDs of application_record_df {len(set(application_record_df['ID']))}")
print(f"The number of distinct IDs of credit_record_df {len(set(credit_record_df['ID']))}")
The shape of application_record_df (438557, 18) The shape of credit_record_df (1048575, 3) The number of distinct IDs of application_record_df 438510 The number of distinct IDs of credit_record_df 45985
print('Missing value of application_record_df')
print(application_record_df.isna().any())
print('Missing value of credit_record_df')
print(credit_record_df.isna().any())
Missing value of application_record_df ID False CODE_GENDER False FLAG_OWN_CAR False FLAG_OWN_REALTY False CNT_CHILDREN False AMT_INCOME_TOTAL False NAME_INCOME_TYPE False NAME_EDUCATION_TYPE False NAME_FAMILY_STATUS False NAME_HOUSING_TYPE False DAYS_BIRTH False DAYS_EMPLOYED False FLAG_MOBIL False FLAG_WORK_PHONE False FLAG_PHONE False FLAG_EMAIL False OCCUPATION_TYPE True CNT_FAM_MEMBERS False dtype: bool Missing value of credit_record_df ID False MONTHS_BALANCE False STATUS False dtype: bool
This is a sample of repeated ID. The ID field should be unique. But for ID 7052783, there are 2 different raws in the dataset. So we can't asure which one is correct. This might bring the problem of data merging. And there about 30 repeated IDs in the dataset. So we just drop them all.
application_record_df[application_record_df['ID']==7052783]
| ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 421726 | 7052783 | M | Y | Y | 0 | 157500.0 | Working | Higher education | Married | House / apartment | -13428 | -2589 | 1 | 0 | 1 | 0 | Laborers | 2.0 |
| 422660 | 7052783 | M | Y | Y | 2 | 166500.0 | Working | Secondary / secondary special | Married | House / apartment | -15883 | -2697 | 1 | 1 | 0 | 1 | Managers | 4.0 |
application_record_df=application_record_df.drop_duplicates(subset='ID', keep=False)
If a customer has no loan or has paid off, he/she will be marked with 'X' or 'C'. So in this case, I convert them to -1. The strategy of defining risky customer is those who once overdue a bill over 30 days will be marked. So in the table application_record, I analyzed all records of customers and marked normal customers as 0, risky customers as 1.
grouped=credit_record_df.groupby('ID').max()
grouped.loc[grouped['STATUS'] > 0, 'STATUS']=1
sns.countplot(grouped['STATUS'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faa7e75ec10>
Link the two datasets with ID.
df=application_record_df.merge(grouped, how='inner', left_on='ID', right_on='ID')
df
| ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | MONTHS_BALANCE | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5008804 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 | -13 | 1 |
| 1 | 5008805 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 | -12 | 1 |
| 2 | 5008806 | M | Y | Y | 0 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | -21474 | -1134 | 1 | 0 | 0 | 0 | Security staff | 2.0 | -8 | 0 |
| 3 | 5008808 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | 0 | 0 |
| 4 | 5008810 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 | -16 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31997 | 5149828 | M | Y | Y | 0 | 315000.0 | Working | Secondary / secondary special | Married | House / apartment | -17348 | -2420 | 1 | 0 | 0 | 0 | Managers | 2.0 | 0 | 1 |
| 31998 | 5149834 | F | N | Y | 0 | 157500.0 | Commercial associate | Higher education | Married | House / apartment | -12387 | -1325 | 1 | 0 | 1 | 1 | Medicine staff | 2.0 | -5 | 1 |
| 31999 | 5149838 | F | N | Y | 0 | 157500.0 | Pensioner | Higher education | Married | House / apartment | -12387 | -1325 | 1 | 0 | 1 | 1 | Medicine staff | 2.0 | -14 | 1 |
| 32000 | 5150049 | F | N | Y | 0 | 283500.0 | Working | Secondary / secondary special | Married | House / apartment | -17958 | -655 | 1 | 0 | 0 | 0 | Sales staff | 2.0 | 0 | 1 |
| 32001 | 5150337 | M | N | Y | 0 | 112500.0 | Working | Secondary / secondary special | Single / not married | Rented apartment | -9188 | -1193 | 1 | 0 | 0 | 0 | Laborers | 1.0 | 0 | 1 |
32002 rows × 20 columns
report=pandas_profiling.ProfileReport(df)
report
Summarize dataset: 100%|██████████| 34/34 [00:13<00:00, 2.48it/s, Completed] Generate report structure: 100%|██████████| 1/1 [00:03<00:00, 3.71s/it] Render HTML: 100%|██████████| 1/1 [00:01<00:00, 1.90s/it]
CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY use string to represent the spacific meaning in the fields. But the machine learning library sklearn cannot recognize the labels. So for those 3 fileds, we use 0, 1 to instead.
CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED and CNT_FAM_MEMBERS are continuous variables. In this case, they are all converted to categories.
NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE, OCCUPATION_TYPE are multi-labeled columns, so I use LabelEncoder to encode those labels. Meanwhile, OCCUPATION_TYPE has missing values. In this case, I use 'Other' to replace the missing values.
The column FLAG_MOBIL only contains 1. So it is a constant that will be removed.
df
| ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5008804 | 1 | 1 | 1 | 0 | 2.0 | 4 | 1 | 0 | 4 | 1 | 2.0 | 1 | 0 | 0 | 12 | 2.0 | 1 |
| 1 | 5008805 | 1 | 1 | 1 | 0 | 2.0 | 4 | 1 | 0 | 4 | 1 | 2.0 | 1 | 0 | 0 | 12 | 2.0 | 1 |
| 2 | 5008806 | 1 | 1 | 1 | 0 | 0.0 | 4 | 4 | 1 | 1 | 3 | 0.0 | 0 | 0 | 0 | 17 | 2.0 | 0 |
| 3 | 5008808 | 0 | 0 | 1 | 0 | 1.0 | 0 | 4 | 3 | 1 | 3 | 1.0 | 0 | 1 | 1 | 15 | 1.0 | 0 |
| 4 | 5008810 | 0 | 0 | 1 | 0 | 1.0 | 0 | 4 | 3 | 1 | 3 | 1.0 | 0 | 1 | 1 | 15 | 1.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31997 | 5149828 | 1 | 1 | 1 | 0 | 1.0 | 4 | 4 | 1 | 1 | 2 | 1.0 | 0 | 0 | 0 | 10 | 2.0 | 1 |
| 31998 | 5149834 | 0 | 0 | 1 | 0 | 0.0 | 0 | 1 | 1 | 1 | 1 | 0.0 | 0 | 1 | 1 | 11 | 2.0 | 1 |
| 31999 | 5149838 | 0 | 0 | 1 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0.0 | 0 | 1 | 1 | 11 | 2.0 | 1 |
| 32000 | 5150049 | 0 | 0 | 1 | 0 | 1.0 | 4 | 4 | 1 | 1 | 2 | 0.0 | 0 | 0 | 0 | 15 | 2.0 | 1 |
| 32001 | 5150337 | 1 | 0 | 1 | 0 | 0.0 | 4 | 4 | 3 | 4 | 0 | 0.0 | 0 | 0 | 0 | 8 | 1.0 | 1 |
32002 rows × 18 columns
Usually, we need to split dataset into 2 parts for model testing. Empirically, 70% of data are used for training dataset, 30% of data are use for testing. The status is the dependent variable. And there are 16 independent variables.
from sklearn.model_selection import train_test_split
X=df.iloc[:,1:-1]
y=df['STATUS']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=100)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
model=DecisionTreeClassifier(max_depth=20, min_samples_leaf=30)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
Accuracy of training dataset: 0.8660327663943574 Accuracy of test dataset: 0.8651182168524112 Confusion matrix of training dataset: [[19383 14] [ 2987 17]] Confusion matrix of test dataset: [[8304 10] [1285 2]]
If we draw the decision tree, we can find something goes strange. OCCUPATION_TYPE indicates the different types of jobs. The original data is in categories. But the machine learning models in sklearn only support number as input. So we encoded it. But the number here has no spacific mieaning. It's not comparable. But the decision tree model regard it as a number and split the node with the range of number.
from sklearn import tree
import graphviz
dot_tree=tree.export_graphviz(model, feature_names = X_train.columns, max_depth=2, filled=True)
graph = graphviz.Source(dot_tree)
graph
To solve this problem, I use one-hot encoding to re-encode the whole dataset. One-hot encoding can separate the labels in a column into columns with only 0 and 1.
dummy_columns=['CNT_CHILDREN', 'AMT_INCOME_TOTAL','NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS','NAME_HOUSING_TYPE',
'DAYS_BIRTH','DAYS_EMPLOYED','OCCUPATION_TYPE','CNT_FAM_MEMBERS']
dummy_X=pd.get_dummies(X, columns=dummy_columns)
dummy_X
| CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | CNT_CHILDREN_0 | CNT_CHILDREN_1 | CNT_CHILDREN_2 | AMT_INCOME_TOTAL_0.0 | ... | OCCUPATION_TYPE_12 | OCCUPATION_TYPE_13 | OCCUPATION_TYPE_14 | OCCUPATION_TYPE_15 | OCCUPATION_TYPE_16 | OCCUPATION_TYPE_17 | OCCUPATION_TYPE_18 | CNT_FAM_MEMBERS_1.0 | CNT_FAM_MEMBERS_2.0 | CNT_FAM_MEMBERS_3.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31997 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 31998 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 31999 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 32000 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 32001 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
32002 rows × 62 columns
model=DecisionTreeClassifier(max_depth=20, min_samples_leaf=30)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
Accuracy of training dataset: 0.866211329851346 Accuracy of test dataset: 0.8649099052182064 Confusion matrix of training dataset: [[19381 16] [ 2981 23]] Confusion matrix of test dataset: [[8296 18] [1279 8]]
The structure of the decision tree
from sklearn import tree
import graphviz
dot_tree=tree.export_graphviz(model,feature_names = dummy_X.columns, max_depth=2,
filled=True)
graph = graphviz.Source(dot_tree)
graph
But, still, we have another problem. The dataset is imbalanced. There are only 27711 marked 0 and 4291 marked 1. So, we need to balance the dataset to enhance the model's generalization. Here I used the Synthetic Minority Over-sampling TEchnique (SMOTE) to oversample to minority.
from imblearn.over_sampling import SMOTE
print('Before oversampling: \n', y.groupby(y).count())
X_balance,y_balance = SMOTE().fit_sample(dummy_X,y)
X_train, X_test, y_train, y_test = train_test_split(X_balance, y_balance, stratify=y_balance, test_size=0.3, random_state=100)
print('After oversampling: \n', y_balance.groupby(y_balance).count())
Before oversampling: STATUS 0 27711 1 4291 Name: STATUS, dtype: int64 After oversampling: STATUS 0 27711 1 27711 Name: STATUS, dtype: int64
Now I use the Random Forest to model the data. A single decision tree is a weak classifier. It's easy to overfit or underfit the data. So random forest can solve this problem by using bagging strategy.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=30, min_samples_leaf=5)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
Accuracy of training dataset: 0.8581518236886196 Accuracy of test dataset: 0.8373729476153244 Confusion matrix of training dataset: [[16903 2494] [ 3009 16389]] Confusion matrix of test dataset: [[7076 1238] [1466 6847]]
Empirical parameters might not be the best choice for a model. So I used grid search cross validation to find the best combination of parameters. Here I chose 3 parameters for optimization, n_estimators, min_samples_leaf and max_depth. They will be test in a range one by one and return a model with the highest accuracy.
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators':range(20,101,10), 'min_samples_leaf':range(2,20,2), 'max_depth':range(10,100,5)}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(max_features='sqrt',
random_state=10),
param_grid = param_test1, scoring='roc_auc',iid=False,cv=5,n_jobs=4)
gsearch1.fit(X_train,y_train)
/Users/caiboqin/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24. "removed in 0.24.", FutureWarning
GridSearchCV(cv=5,
estimator=RandomForestClassifier(max_features='sqrt',
random_state=10),
iid=False, n_jobs=4,
param_grid={'max_depth': range(10, 100, 5),
'min_samples_leaf': range(2, 20, 2),
'n_estimators': range(20, 101, 10)},
scoring='roc_auc')
gsearch1.best_estimator_ , gsearch1.best_params_, gsearch1.best_score_
(RandomForestClassifier(max_depth=40, max_features='sqrt', min_samples_leaf=2,
random_state=10),
{'max_depth': 40, 'min_samples_leaf': 2, 'n_estimators': 100},
0.9210556113675807)
y_predict_train=gsearch1.predict(X_train)
y_predict=gsearch1.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
Accuracy of training dataset: 0.8812991364866607 Accuracy of test dataset: 0.8574607566007096 Confusion matrix of training dataset: [[17144 2253] [ 2352 17046]] Confusion matrix of test dataset: [[7148 1166] [1204 7109]]
Basically, I used Python + Jupyter notebook to finish all the data analysis. Final report will be presented as a notebook.
After data exploration, I found some problems in the dataset.
So, I use pandas to clean the data. And I re-encode the whole dataset. All the fields after data cleaning are categorical variable, which is not linear. So I chose non-linear machine learning methods to solve the problem, decision tree and random forest.
The simple decision tree can’t fit data well because the original data set is extremely imbalanced. So I used Synthetic Minority Over-sampling TEchnique (SMOTE) to oversample the dataset. Also, I used random forest based on bagging strategy to avoid overfitting or underfitting.
Usually, empirical parameters might not be the best choice for a model. So I used grid search cross validation to find the best combination of parameters. Here I chose 3 parameters for optimization, which are n_estimators, min_samples_leaf and max_depth. They will be tested in a range one by one and return a model with the highest accuracy.
Finally, the result of accuracy of random forest on training dataset reaches 0.88, and the test dataset also reaches 0.85, which is better than the decision tree.
Course: Machine Learning and Data Science for Social Good (20S856137)
Authors: Boqin Cai (boqin.cai@stud.sbg.ac.at)